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Abstract 

For supervised and unsupervised learning, positive definite kernels allow to use large and 
potentially infinite dimensional feature spaces with a computational cost that only depends on 
the number of observations. This is usually done through the penalization of predictor functions 
by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing 
norms such as the £^-norm or the block £^-norm. We assume that the kernel decomposes into 
a large sum of individual basis kernels which can be embedded in a directed acycUc graph; we 
show that it is then possible to perform kernel selection through a hierarchical multiple kernel 
learning framework, in polynomial time in the number of selected kernels. This framework 
is naturally applied to non linear variable selection; our extensive simulations on synthetic 
datasets and datasets from the UCI repository show that efficiently exploring the large feature 
space through sparsity-inducing norms leads to state-of-the-art predictive performance. 

1 Introduction 

In the last two decades, kernel methods have been a prolific theoretical and algorithmic machine 
learning framework. By using appropriate regularization by Hilbertian norms, representer theorems 
enable to consider large and potentially infinite-dimensional feature spaces while working within an 
implicit feature space no larger than the number of observations. This has led to numerous works on 
kernel design adapted to specific data types and generic kernel-based algorithms for many learning 
tasks (see, e.g., mill)- 

Regularization by sparsity-inducing norms, such as the £^-norm has also attracted a lot of in- 
terest in recent years. While early work has focused on efficient algorithms to solve the convex 
optimization problems, recent research has looked at the model selection properties and predictive 
performance of such methods, in the linear case [3J or within the multiple kernel learning frame- 
work H. 

In this paper, we aim to bridge the gap between these two lines of research by trying to use 
£^ -norms inside the feature space. Indeed, feature spaces are large and we expect the estimated 
predictor function to require only a small number of features, which is exactly the situation where 
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^ -norms have proven advantageous. This leads to two natural questions that we try to answer in 
this paper: (1) Is it feasible to perform optimization in this very large feature space with cost which 
is polynomial in the size of the input space? (2) Does it lead to better predictive performance and 
feature selection? 

More precisely, we consider a positive definite kernel that can be expressed as a large sum of 
positive definite basis or local kernels. This exactly corresponds to the situation where a large fea- 
ture space is the concatenation of smaller feature spaces, and we aim to do selection among these 
many kernels, which may be done thi^ough multiple kernel learning |t5j|. One major difficulty how- 
ever is that the number of these smaller kernels is usually exponential in the dimension of the input 
space and applying multiple kernel learning directly in this decomposition would be intractable. 

In order to peform selection efficiently, we make the extra assumption that these small kernels 
can be embedded in a directed acyclic graph (DAG). Following |6ii7j|, we consider in Section |2] 
a specific combination of £^-norms that is adapted to the DAG, and will restrict the authorized 
sparsity patterns; in our specific kernel framework, we are able to use the DAG to design an opti- 
mization algorithm which has polynomial complexity in the number of selected kernels (Section^. 
In simulations (Section [5]l, we focus on directed grids, where our framework allows to perform 
non-linear variable selection. We provide extensive experimental validation of our novel regular- 
ization framework; in particular, we compare it to the regular ^^-regularization and shows that it is 
always competitive and often leads to better performance, both on synthetic examples, and standard 
regression and classification datasets from the UCI repository. 

Finally, we extend in Section |4] some of the known consistency results of the Lasso and mul- 
tiple kernel learning |l3l|4l, and give a partial answer to the model selection capabilities of our 
regularization framework by giving necessary and sufficient conditions for model consistency. In 
particular, we show that our framework is adapted to estimating consistently only the hull of the 
relevant variables. Hence, by restricting the statistical power of our method, we gain computational 
efficiency. 

2 Hierarchical multiple kernel learning (HKL) 

We consider the problem of predicting a random variable y G 3^ C M from a random variable X G 
X, where X and y may be quite general spaces. We assume that we are given n i.i.d. observations 
{^iiVi) G X X y, i = 1, . . . ,n. We define the empirical risk of a function / from A' to R as 
n X]"=i ^(y*' fi^i))' where £ : y x t-^ M.'^ is a loss function. We only assume that £ is convex 
with respect to the second pai^ameter (but not necessarily differentiable). Typical examples of loss 
functions are the square loss for regression, i.e., £{y, y) = ^{y — y)^ for y G M, and the logistic loss 
Ku^ y) — log(l + e~yy) or the hinge loss £{y, y) = max{0, 1 — yy} for binary classification, where 
y G {—1,1}, leading respectively to logistic regression and support vector machines. Other losses 
may be used for other settings (see, e.g., IS or the Appendix). 

2.1 Graph-structured positive definite kernels 

We assume that we are given a positive definite kernel k : X x X ^ and that this kernel can 
be expressed as the sum, over an index set V, of basis kernels k.^, v ^ V, i.e, for all x, x' G X, 
k{x, x') = ky{x, x'). For each f G 1^, we denote by JT^, and the feature space and feature 

map of ky, i.e., for all x,x' G X, kv{x,x') = <I>t,(x')). Throughout the paper, we denote 
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Figure 1 : Example of graph and associated notions. (Left) Example of a 2D-grid. (Middle) Example 
of sparsity pattern (x in light blue) and the complement of its hull (+ in light red). (Right) Dark 
blue points (x) are extreme points of the set of all active points (blue x); dark red points (+) are 
the sources of the set of all red points (+). 

by ||ii|| the Hilbertian norm of u and by {u, v) the associated dot product, where the precise space 
is omitted and can always be inferred from the context. 

Our sum assumption corresponds to a situation where the feature map and feature space 
for k is the concatenation of the feature maps ^v{x) for each kernel k^, i.e, = n,;ev' and 
= {(^v{x))v^v ■ Thus, looking for a certain /3 ^ and a predictor function f{x) = (/?, <&(x)) 
is equivalent to looking jointly for (3^ G ^v, for all v e V, and f{x) = Ylvi^vif^v, ^v{x))- 

As mentioned earlier, we make the assumption that the set V can be embedded into a directed 
acyclic graph. Directed acyclic graphs (refeiTcd to as DAGs) allow to naturally define the notions 
of parents, children, descendants and ancestors. Given a node w £ V,we denote by A{w) C V the 
set of its ancestors, and by D{w) C V, the set of its descendants. We use the convention that any 
u) is a descendant and an ancestor of itself, i.e., w € A{w) and w G D(u;). Moreover, for W C V, 
we let denote sources(VF) the set of sources of the graph G restricted to W (i.e., nodes in W with 
no pai^ents belonging to W). Given a subset of nodes W C V, we can define the hull of W as the 
union of all ancestors of w € W, i.e., hull(Ty) = Uwgvy ^(^)- Given a set W, we define the set 
of extreme points of W as the smallest subset T C W such that hulI(T) = hull(VF) (note that it is 
always well defined, as HtcV huii(T)=huii(VF) Figure[T]for examples of these notions. 

The goal of this paper is to perform kernel selection among the kernels k.u,v G V. We essentially 
use the graph to Umit the search to specific subsets of V. Namely, instead of considering all possible 
subsets of active (relevant) vertices, we are only interested in estimating correctly the hull of these 
relevant vertices; in Section l2!2l we design a specific sparsity-inducing norms adapted to hulls. 

In this paper, we primarily focus on kernels that can be expressed as "products of sums", and on 
the associated p-dimensional directed grids, while noting that our framework is applicable to many 
other kernels. Namely, we assume that the input space X factorizes into p components X = Xi x 
■ ■ ■ X Xp and that we are given p sequences of length q + 1 of kernels kij{xi,x[),i G {!,... ,p}, j G 

{0,...,g}, such that = Eji,...jp=o OLi ^i) = HLi (Ej^o ^i)) ■ We 

thus have a sum of {q+iy kernels, that can be computed efficiently as a product of p sums. A natural 
DAG onV = nr=i{Oi • • • , is defined by connecting each (ji, . . . ,jp) to (ji + 1,^2, • • • , jp), 
. . . , (ji , . . . , , jp + 1). As shown in Section I2.2[ this DAG will correspond to the constraint 
of selecting a given product of kernels only after all the subproducts are selected. Those DAGs 
are especially suited to nonlinear variable selection, in particular with the polynomial and Gaussian 
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kernels. In this context, products of kernels correspond to interactions between certain variables, and 
our DAG implies that we select an interaction only after all sub-interactions were already selected. 
Polynomial kernels We consider = M, kij{xi,x'j) = (xjX^)-'; the full kernel is then 

equal to k{x,x') = nf=i I]j=o (p(^j^'i)-' = TliLill + Xix'i)'^. Note that this is not exactly 
the usual polynomial kernel (whose feature space is the space of multivariate polynomials of total 
degree less than q), since our kernel considers polynomials of maximal degree q. 

Gaussian kernels We also consider Xi = M., and the Gaussian-RBF kernel e~''^^~^ ^ . The 
following decomposition is the eigendecomposition of the non centered covariance operator for a 
normal distribution with variance l/4a (see, e.g., ifSl): 

= J2^^^(J^[e~^i'^+-)^'H,{VTcx)][e~^('^+^^^^^^^ 

where = + 2ab, A = a + b + c, and Hk is the A;-th Hermite polynomial. By appropriately 
truncating the sum, i.e, by considering that the first q basis kernels are obtained from the first q 
single Hermite polynomials, and the {q + l)-th kernel is summing over all other kernels, we ob- 
tain a decomposition of a uni-dimensional Gaussian kernel into q + I components (q of them are 
one-dimensional, the last one is infinite-dimensional, but can be computed by differencing). The 
decomposition ends up being close to a polynomial kernel of infinite degree, modulated by an ex- 
ponential 121. One may also use an adaptive decomposition using kernel PCA (see, e.g., EKH), 
which is equivalent to using the eigenvectors of the empirical covariance operator associated with 
the data (and not the population one associated with the Gaussian distribution with same variance). 
In simulations, we tried both with no significant differences. 

Finally, by taking product over all variables, we obtain a decomposition of the p-dimensional 
Gaussian kernel into {q + 1)^ components, that are adapted to nonlinear variable selection. Note 
that for g = 1, we obtain ANOVA-like decompositions 121 . 

Kernels or features? In this paper, we emphasize the kernel view, i.e., we are given a kernel 
(and thus a feature space) and we explore it using £^ -norms. Alternatively, we could use the feature 
view, i.e., we have a large structured set of features that we try to select from; however, the tech- 
niques developed in this paper assume that (a) each feature might be infinite-dimensional and (b) 
that we can sum all the local kernels efficiently (see in particular Section [l!2l ). Following the kernel 
view thus seems slightly more natural. 

2.2 Graph-based structured regularization 

Given (3 G Ougy -^i" the natural Hilbertian norm ||/3|| is defined through = J2vev 11/^'^ IP- 
Penalizing with this norm is efficient because summing all kernels is assumed feasible in poly- 
nomial time and we can bring to bear the usual kernel machinery; however, it does not lead to sparse 
solutions, where many (3y will be exactly equal to zero. 

As said earlier, we are only interested in the hull of the selected elements Pv ^ J^v, v ^ V; the 
hull of a set I is characterized by the set of v, such that T>{v) C I'^, i.e., such that all descendants of 
V are in the complement P: hull(/) = G F, D(w) C I'^Y- Thus, if we try to estimate hull(/), 
we need to determine which f G y are such that T>{v) C P. In our context, we are hence looking 
at selecting vertices v £ V for which /^^(t)) = {Pw)weD{v) = 0- 

We thus consider the following structured block £^-norm defined as 
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where {dy)y^v positive weights. Penahzing by such a norm will indeed impose that some of 
the vectors /3D(t;) £ ni«gD(u) exactly zero. We thus consider the following minimization 

proble 

(/?,,^>,(x,))) + A(^^^^4||/3d{„)||) . (1) 

Our Hilbertian norm is a Hilbert space instantiation of the hierarchical norms recently introduced 
by 161. If all Hilbert spaces are finite dimensional, our particular choice of norms corresponds to an 
"^^-norm of £^ -norms". While with uni-dimensional groups or kernels, the "^^-norm of ^°°-norms" 
allows an efficient path algorithm for the square loss and when the DAG is a tree (Q, this is not 
possible anymore with groups of size larger than one, or when the DAG is a not a tree. In Section |3l 
we propose a novel algorithm to solve the associated optimization problem in time polynomial in the 
number of selected groups or kernels, for all group sizes, DAGs and losses. Moreover, in Section HJ 
we show under which conditions a solution to the problem in Eq. ([B consistently estimates the hull 
of the sparsity pattern. 

Finally, note that in certain settings (finite dimensional Hilbert spaces and distributions with 
absolutely continuous densities), these norms have the effect of selecting a given kernel only after 
all of its ancestors 161. This is another explanation why hulls end up being selected, since to include 
a given vertex in the models, the entire set of ancestors must also be selected. 

3 Optimization problem 

In this section, we give optimality conditions for the problems in Eq. ([T]), as well as optimization 
algorithms with polynomial time complexity in the number of selected kernels. In simulations we 
consider total numbers of kernels larger than 10^*^, and thus such efficient algorithms are essential 
to the success of hierarchical multiple kernel learning (HKL). 

3.1 Reformulation in terms of multiple kernel learning 

Following ||9j [lOl, we can simply derive an equivalent fomiulation of Eq. ([T]). Using Cauchy- 
Schwarz inequality, we have that for all r] G such that ^ and X^^gy d^rj^ ^ 1, 

(E 

with equality if and only if ry^ = ||/9d(i;) II (Eugy dv\\PB{v) ll)~^- We associate to the vector 
rj G MY , the vector C, G such that Vw G V, = EiigA(«i) Vv^- We use the natural convention 
that if 7]y is equal to zero, then is equal to zero for all descendants w of v. We let denote H 
the set of allowed rj and Z the set of all associated The set H and Z are in bijection, and we 
can interchangeably use r] ^ H or the corresponding C,{r]) G Z. Note that Z is in general not 
convex (unless the DAG is a tree, see the Appendix), and if C, ^ Z, then ^ for all w G D(t;), 
i.e., weights of descendant kernels are smaller, which is consistent with the known fact that kernels 
should always be selected after all their ancestors. 

The problem in Eq. ([T]) is thus equivalent to 



'Following (5J, we consider the square of the norm, which does not change the regularization properties, but allow 
simple links with multiple kernel learning. 
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~ 1/2 ~ 1/2 

Using the change of variable 13^ = PvCv and ^{x) = {(v ^v{x))vev^ this implies that 
given the optimal r/ (and associated Q, (3 corresponds to the solution of the regular supervised 
learning problem with kernel matrix K = J2weV CwKw, where is n x n the kernel matrix 
associated with kernel ku,. Moreover, the solution is then = Cw Y17=i o^i^wixi), where a E R" 
are the dual parameters associated with the single kernel learning problem. 

Thus, the solution is entirely determined by a G M" and rj G MY (and its corresponding ( G 
MY). More precisely, we have (see proof in the Appendix): 

Proposition 1 The pair {a,r]) is optimal for Eq. ([7]), with \/w,Pw = CwY17=i '^i^w{xi), if and 
only if (a) given rj, a is optimal for the single kernel learning problem with kernel matrix K = 
Siogv Cw{i])Kw, and (6) given a, t] £ H maximizes 

■wSV veA{w) 

Moreover, the total duality gap can be upperbounded as the sum of the two separate duality gaps for 
the two optimization problems, which will be useful in Section [l!2] (see Appendix for more details). 
Note that in the case of "flat" regular multiple kernel learning, where the DAG has no edges, we 
obtain back usual optimality conditions |[9l[T0l. 

Following a common practice for convex sparsity problems lITTI . we will try to solve a small 
problem where we assume we know the set of v such that ||/?D(t))ll is equal to zero (Section [33] ). 
We then "simply" need to check that variables in that set may indeed be left out of the solution. In 
the next section, we show that this can be done in polynomial time although the number of kernels 
to consider leaving out is exponential (Section 1331 ). 

3.2 Conditions for global optimality of reduced problem 

We let denote J the complement of the set of norms which are set to zero. We thus consider the 
optimal solution /? of the reduced problem (on J), namely, 

with optimal primal variables /3j, dual variables a and optimal pair (rjjXj)- We now consider 
necessary conditions and sufficient conditions for this solution (augmented with zeros for non active 
variables, i.e., variables in J"^) to be optimal with respect to the full problem in Eq. ([T]). We denote 
by = J2vgj dv\\PD(v)nj\\ the optimal value of the norm for the reduced problem. 

Proposition 2 (Nj) If the reduced solution is optimal for the full problem in Eq. (|7]) and all kernels 
in the extreme points of J are active, then we have 

max cJ Kta/ ^ 5^ . 

t€sources( J*^) 

Proposition 3 {Sj^e) ¥ ™-^^t&ouTccs{j'') E«,eD(t) "^^u'"/(E^GA(«,)nD(t) dv)"^ + e/ A, then 
the total duality gap is less than e. 
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The proof is fairly technical and can be found in the Appendix; this result constitutes the main 
technical contribution of the paper: it essentially allows to solve a very large optimization problem 
over exponentially many dimensions in polynomial time. 

The necessary condition {Nj) does not cause any computational problems. However, the suffi- 
cient condition (Sj g) requires to sum over all descendants of the active kernels, which is impossible 
in practice (as shown in Section[5l we consider V of cardinal often greater than 10'^'^). Here, we need 
to bring to bear the specific structure of the kernel k. In the context of directed grids we consider in 
this paper, if dy can also be decomposed as a product, then X]t,gA(«))nD{t) ^^^^ factorized, and 
we can compute the sum over all v G D(t) in linear time in p. Moreover we can cache the sums 
E«,GD(t) K^/iT.veA{w)nD{t) dvf in order to save running time. 

3.3 Dual optimization for reduced or small problems 

When kernels ky, v £ V have low-dimensional feature spaces, we may use a primal representation 
and solve the problem in Eq. ([T|l using generic optimization toolboxes adapted to conic constraints 
(see, e.g., [12.|). However, in order to reuse existing optimized supervised learning code and use 
high-dimensional kernels, it is preferable to use a dual optimization. Namely, we use the same tech- 
nique as (ll: we consider for ( € Z, the function B{C) = min^g]-[^^^,j;r^ ^ Yli=i ^iVi^ 'I2vevif^v,^v{xi)))+ 
t Ylwev Cw^WPwW"^, which is the optimal value of the single kernel learning problem with kernel 
matrix Ylwev CwKw Solving Eq. Q is equivalent to minimizing B{C{r])) with respect to rj ^ H. 

If a ridge (i.e., positive diagonal) is added to the kernel matrices, the function B is differentiable. 
Moreover, the function rj ^ ({r]) is differentiable on (M^)^. Thus, the function i] B[({{1 — 
e)r] + i^d"^)] , where d^^ is the vector with elements d"^, is differentiable if e > 0. We can then 
use the same projected gradient descent strategy as 191 to minimize it. The overall complexity of 
the algorithm is then proportional to 0(|y|n^) — to form the kernel matrices — plus the complexity 
of solving a single kernel learning problem — typically between O(n^) and O(n^). 

3.4 Kernel search algorithm 

We are now ready to present the detailed algorithm which extends the feature search algorithm 
of ifTTI . Note that the kernel matrices are never all needed explicitly, i.e., we only need them (a) 
expUcitly to solve the small problems (but we need only a few of those) and (b) implicitly to compute 
the sufficient condition (Sj^e), which requires to sum over all kernels, as shown in Section |3^ 

• Input: kernel matrices S M"^", v V, maximal gap e, maximal # of kernels Q 

• Algorithm 

1. Initiahzation: set J = sources (F), 

compute (a, r]) solutions of Eq. obtained using Section [331 

2. while (Nj) and (Sj^e) are not satisfied and #{V) ^ Q 

- If (Nj) is not satisfied, add violating variables in sources( J'^) to J 

else, add violating variables in sources( J'^) of {Sj^e) to J 

- Recompute (q, t]) optimal solutions of Eq. ^ 

• Output: J, a, T] 
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The previous algorithm will stop either when the duality gap is less than e or when the maximal 
number of kernels Q has been reached. In practice, when the weights increase with the depth of 
V in the DAG (which we use in simulations), the small duality gap generally occurs before we reach 
a problem larger than Q. Note that some of the iterations only increase the size of the active sets to 
check the sufficient condition for optimality; forgetting those does not change the solution, only the 
fact that we may actually know that we have an e-optimal solution. 

In the directed p-grid case, the total running time complexity is a function of the number of 
observations n, and the number R of selected kernels; with proper caching, we obtain the fol- 
lowing complexity, assuming 0{n^) for the single kernel leai^ning problem, which is conservative: 
0{rc'R + n?Rp^ + n^R^p), which decomposes into solving 0{R) single kernel learning problems, 
caching 0{Rp) kernels, and computing 0{R^p) quadratic forms for the sufficient conditions. Note 
that the kernel search algorithm is also an efficient algorithm for unstructured MKL. 



4 Consistency conditions 

As said earlier, the sparsity pattern of the solution of Eq. ^ will be equal to its hull, and thus we 
can only hope to obtain consistency of the hull of the pattern, which we consider in this section. 

For simplicity, we consider the case of finite dimensional Hilbert spaces (i.e., = M-^") and 
the square loss. We also hold fixed the vertex set of V, i.e., we assume that the total number of 
features is fixed, and we let n tend to infinity and A = A„ decrease with n. 

Following |f4l, we make the following assumptions on the underlying joint distribution of (X, 1"): 
(a) the joint covariance matrix S of (^>(2;„))„gi/ (defined with appropriate blocks of size x f^) 
is invertible, (b) E{Y\X) = Y.w&w'^Pw^^w{x)) with W C V and var(y|X) = cr^ > almost 
surely. With these simple assumptions, we obtain (see proof in the Appendix): 

Proposition 4 (Sufficient condition) If we have 

max Diag(ii^||/3Q(„)||^^)„gvi'/3wlP ^ ^ 

then P and the hull ofW are consistently estimated when A„n^/^ oo and A„ 0. 

Proposition 5 (Necessary condition) If the j3 and the hull of W are consistently estimated for 
some sequence A„, then 

max ||S;^vKS^^Diag((i^/||/3D(„)||)„gw/3vKfM^ ^ 1- 

fgsourccs(W^) 

Note that the last two propositions are not consequences of the similar results for flat MKL lH, 
because the groups that we consider are overlapping. Moreover, the last propositions show that we 
indeed can estimate the coiTcct hull of the sparsity pattern if the sufficient condition is satisfied. In 
particular, if we can make the groups such that the between-group correlation is as small as possible, 
we can ensure correct hull selection. Finally, it is worth noting that if the ratios dw/ maXj,gA(io) dv 
tend to infinity slowly with n, then we always consistently estimate the depth of the hull, i.e., the 
optimal interaction complexity. We are cuiTcntly investigating extensions to the non parametric 
case m, in terms of pattern selection and universal consistency. 
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5 Simulations 

Synthetic examples We generated regression data as follows: n = 1024 samples of p G [2^, 2^] 
variables were generated from a random covariance matrix, and the label y G M was sampled as a 
random sparse fourth order polynomial of the input variables (with constant number of monomials). 
We then compare the performance of our hierarchical multiple kernel learning method (HKL) with 
the polynomial kernel decomposition presented in Section [2] to other methods that use the same 
kernel and/or decomposition: (a) the greedy strategy of selecting basis kernels one after the other, a 
procedure similar to |[T3i . and (b) the regular polynomial kernel regularization with the full kernel 
(i.e., the sum of all basis kernels). In Figure|2j we compare the two approaches on 40 replications in 
the following two situations: original data (left) and rotated data (right), i.e., after the input variables 
were transformed by a random rotation (in this situation, the generating polynomial is not sparse 
anymore). We can see that in situations where the underlying predictor function is spai^se (left), 
HKL outperforms the two other methods when the total number of variables p increases, while in 
the other situation where the best predictor is not sparse (right), it performs only slightly better: i.e., 
in non sparse problems, ^^-norms do not really help, but do help a lot when sparsity is expected. 

UCI datasets For regression datasets, we compai^e HKL with polynomial (degree 4) and 
Gaussian-RBF kernels (each dimension decomposed into 9 kernels) to the following approaches 
with the same kernel: regular Hilbertian regularization (L2), same greedy approach as earlier 
(greedy), regularization by the ^^-norm directly on the vector a, a strategy which is sometimes 
used in the context of sparse kernel learning [14] but does not use the Hilbertian structure of the 
kernel (lasso-a), multiple kernel learning with the p kernels obtained by summing all kernels as- 
sociated with a single variable, a strategy suggested by ||5l (MKL). For all methods, the kernels 
were held fixed, while in Table [T] we report the performance for the best regularization parameters 
obtained by 10 random half splits. 

We can see from Table [H that HKL outperforms other methods, in particular for the datasets 
bank-32nm, bank-32nh, pumadyn-32nm, pumadyn-32nh, which are datasets dedicated to non linear 
regression. Note also, that we efficiently explore DAGs with very large numbers of vertices i^{V). 

For binary classification datasets, we compare HKL (with the logistic loss) to two other methods 
(L2, greedy) in Table |2l For some datasets (e.g., spambase), HKL works better, but for some 
others, in particular when the generating problem is known to be non sparse (ringnorm, twonorm), 
it performs slightly worse than other approaches. 
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dataset 



n 



P 



k 



L2 



greedy lasso-a MKL HKL 



abalone 
abalone 



4177 
4177 



10 pol4 
10 rbf 



44.2±1.3 
43.0±0.9 



43.9±1.4 
45.0±1.7 



47.9±0.7 
49.0±1.7 



44.5±1.1 
43.7±1.0 



43.3±1.0 

43.0±1.1 



7]W 



bank-32fh 
bank-32fh 



8192 
8192 



32 pol4 
32 rbf 



40.1±0.7 
39.0ib0.7 



39.2=b0.8 
39.7±0.7 



41.3±0.7 
66.1±6.9 



38.7±0.7 

38.4±0.7 



38.9±0.7 
38.4±0.7 



bank-32fm 
bank-32fm 



8192 
8192 



32 pol4 
32 rbf 



ilO 
AO 



6.0=h0.1 
5.7±0.2 



5.0±0.2 

5.8±0.4 



7.0±0.2 
36.3±4.1 



6.1±0.3 
5.9d=0.2 



S.libO.l 
4.6±0.2 



bank-32nh 
bank-32nh 



8192 
8192 



32 pol4 
32 rbf 



ilO 
ilO 



44.3±1.2 
44.3±1.2 



46.3±1.4 
49.4±1.6 



45.8±0.8 
93.0±2.8 



46.0±1.2 
46.1±1.1 



43.6±1.1 
43.5±1.0 



IT 



bank-32nm 
bank-32nm 



8192 
8192 



32 pol4 
32 rbf 



AO 
^1031 



17.2±0.6 
16.9±0.6 



18.2±0.8 
21.0±0.6 



19.5±0.4 
62.3±2.5 



21.0±0.7 
20.9±0.7 



16.8d=0.6 
16.4±0.6 



boston 
boston 



506 
506 



13 pol4 
13 rbf 



5il0« 

AO 



12 



17.1±3.6 
16.4±4.0 



24.7±10.8 
32.4±8.2 



29.3±2.3 
29.4±1.6 



22.2±2.2 
20.7±2.1 



18.1±3.8 
17.1±4.7 



IT 



pumadyn- 
pumadyn- 



32fh 
32fh 



8192 
8192 



32 pol4 
32 rbf 



AO 
.1031 



57.3±0.7 
57.7±0.6 



56.4=b0.8 
72.2±22.5 



57.5±0.4 
89.3±2.0 



56.4±0.7 

56.5±0.8 



56.4=b0.8 
55.7±0.7 



pumadyn- 
pumadyn- 



32fm 
32fm 



8192 
8192 



32 pol4 
32 rbf 



AO 
AO 



6.9±0.1 
5.0=h0.1 



6.4±1.6 
46.2±51.6 



7.5±0.2 
44.7±5.7 



7.0±0.1 
7.1±0.1 



3.1±0.0 
3.4=b0.0 



AO^ 
.1031 



pumadyn- 
pumadyn- 



32nh 
32nh 



8192 
8192 



32 pol4 
32 rbf 



84.2±1.3 
56.5±1.1 



73.3±25.4 
81.3±25.0 



84.8±0.5 
98.1±0.7 



83.6±1.3 
83.7±1.3 



36.7=b0.4 
35.5±0.5 



AO^ 
.1031 



pumadyn- 
pumadyn- 



32nm 
32nm 



8192 
8192 



32 pol4 
32 rbf 



60.1±1.9 
15.7±0.4 



69.9±32.8 
67.3±42.4 



78.5±1.1 
95.9±1.9 



77.5±0.9 
77.6±0.9 



S.SibO.l 
7.2±0.1 



Table 1: Mean squared eiTors (multiplied by 100) on UCI regression datasets, normalized so that 
the total variance to explain is 100. See text for details. 



dataset n p k #{V) 


L2 greedy HKL 


mushrooms 1024 117 pol4 psIO^^ 
mushi-ooms 1024 117 rbf wIQH^ 


0.4±0.4 O.lzbO.l 0.1±0.2 
0.1±0.2 0.1±0.2 0.1=b0.2 


ringnorm 1024 20 pol4 sslQi^ 
ringnorm 1024 20 rbf sslQis 


3.8±1.1 5.9±1.3 2.0±0.3 
1.2±0.4 2.4±0.5 1.6±0.4 


spambase 1024 57 pol4 ^10'^'^ 
spambase 1024 57 rbf ^10^"^ 


8.3±1.0 9.7±1.8 8.1±0.7 
9.4±1.3 10.6±1.7 8.4±1.0 


twonorm 1024 20 pol4 psIQI'' 
twonorm 1024 20 rbf wlQi^ 


2.9±0.5 4.7±0.5 3.2±0.6 
2.8±0.6 5.1±0.7 3.2±0.6 


magic04 1024 10 pol4 ^10' 
magic04 1024 10 rbf sslQio 


15.9±1.0 16.0±1.6 15.6±0.8 
15.7±0.9 17.7±1.3 15.6±0.9 



Table 2: En^or rates (multiplied by 100) on UCI binary classification datasets. See text for details. 
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6 Conclusion 



We have shown how to perform hieraixhical muhiple kernel learning (HKL) in polynomial time in 
the number of selected kernels. This framework may be applied to many positive definite kernels 
and we have focused on polynomial and Gaussian kernels used for nonlinear variable selection. 
In particular, this paper shows that trying to use £^-type penalties may be advantageous inside the 
feature space. We are cuixently investigating applications to other kernels, such as the pyramid 
match kernel |[T5l . string kernels, and graph kernels O. 

A Optimization results 

In this first section, we give proofs of all results related to the optimization problems. We first recall 
precisely how we obtained the relationships between 7] and (. Using Cauchy-Schwarz inequality, 
we know that for all i] G such that rj ^ and J2veV '^vVv ^ 1> 




■'w II ) 



with equahty if and only if i]^ = d^'^ WPdm II (Y^vev dv\\PD(v) II) ^ 



A.l Set of weights for trees 

When the DAG is a tree (i.e., when each vertex has at most one parent), then, without loss of 
generality we may consider that only one vertex has no parent (the root r) while all others w have 
exactly one parent tt{w). In this situation, we have for all v ^ r, C^^y) ~ Cv^ = ~%{v) - Moreover, 
for all leaves v, = Vv- This implies that the constraint ^ is equivalent to C ^ and for all 
V ^ r, C7r(i,) ^ Cv- The final constraint J2v£V Vvd^ ^ 1, may then be written as: 

E^ v-i_V-i + E c.rf'^i, 

vjLr ^n{v) „ leaf 

that is, 



E^'fc. + -^)+ E (^dUi, 



■"^r ^ ' t, leaf 

which is cleai^ly convex [12,1 . When the DAG is not a tree, we conjecture that the set Z is not convex. 



A.l Fenchel conjugates 

Following llT6l ITTl . in order to derive optimality conditions for all losses, we need to introduce 
Fenchel conjugates. Let V'i : 1^ IR, be the Fenchel conjugate llT2l of the convex function ipi : 

a 1— > £{yi, a), defined as 

ijjiib) = max ab — (pi{a). 
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The function ipi is always convex and, because we have assumed that ipi is convex and continuous, 
we can represent (pi as the Fenchel conjugate of Tpi, i.e., for all a G M, 

ipi{a) = max ab — ipi{b). 



In particular, we have for the following standard examples: 

• for least-squares regression, we have tpi{a) = \{yi — a)^ and V'j(^) = + byi, 

• for logistic regression, we have ipi{a) = log(l + exp(— yjOj)), where yi G {—1,1}, and 
\l)i{h) = (1 + hyi) log(l + hyi) - hyi \og{-hyi) if hyi G [-1, 0], +oo otherwise. 

• for support vector machine classification, we have (/3j(a) = max(0, 1 — yta), where yi G 
{—1, 1}, and = Vib if byi G [—1, 0], +oo otherwise. 

A.3 Preliminary propositions 

We first recall the duality result for the regular ^^-norm kernel learning problem: 
Proposition 6 For all nonnegative G M^, the dual of the optimization problem 

is 

1 " A / \ 

max y^jpi{-nXai) - -a^ CwK^, a, 

i=l \w&V / 

and the optimal (3 can be found from an optimal a as (3^ = XlILi (^i^w{xi)- 

Proof We introduce auxiliary variables Ui = Y^^fzyiPv, ^v{xi)) and consider the Lagrangian: 



1 A " 

Minimizing with respect to the primal variables n, (3, we get the dual problem. 



We will use the following simple result, which implies that each component Cui (^) is a concave 
function of rj: 

Lemma 1 T/ie minimum o/X^JLi ajX^ subject to Yl%=i Xj = 1 '■5' egwa/ to (^^=1 "^j"^) '^"'^ 
attained at Xi = a^^ ( Z^jLi '^7^ 



The following proposition derives the dual of the problem in rj: 
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Proposition? Let L = {k g ]R^^^,Vt(; G V,Ylv &A(w) ^""w — 1}- The following optimization 
problems are dual to each other, and there is no duality gap : 

min max dT^ >^ K^,,,,a^ Ky.a 
max C,w{'r])Ku}0-- 

Proof We have the Lagrangian 

£ = 5^ + ^ r/j, ^ kI^o^ Ku,a - 5^dl j , 
which can be minimized in closed form with respect to 5^ and k G L, and leads to (using Lemma 1): 



mm max >^ K„,„a K,,,a = max a [ 



/ , C,w{ri)K^ a. 



A.4 Duality gaps 

We consider the following function ofr]^H and a G M": 

1 " A 
F{r],a) = ^^ipii-nXai) - -a"^ ^ Cw{ri)K^a. 

This function is convex in r\ (because of Lemma 1) and concave in a, standai^d arguments (e.g., 
primal and dual strict feasibilities) show that there is no duality gap to the variational problems: 

inf sup F{r],a) = sup inf F{r],a). 
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We can decompose the duality gap, given a pair (ry, a) as 
sup F{r],a')— inf F{r]',a) 



= u M^i))) + ^ E Uvr'mn 

I i=l v£V wev ) 

1 " A 1 

i=i w&v i=i 

+ sup ^a''' Cw{v')<^ 

^ n 1 " 

i=l w£V i=l w£V 

+ sup ^a^ ^ Ct«(??')" ~ ^ X] Cwiv)oi^ Kwa. 



Ti'eH 



2 



We thus get the desired upper bound from which proposition 1 (of the main paper) follows, as well 
as the upper bound on the duality gap. 



A.5 Necessary and sufficient conditions - truncated problem 

We assume that we know the optimal solution of a truncated problem where the entire set of de- 
cendants of some nodes have been removed. We let denote J the hull of the set of active variables. 
We now consider necessary conditions and sufficient conditions for this solution to be optimal with 
respect to the full problem. This will lead to Proposition 2 and 3 of the main paper. 

We first use Proposition 2 of the Appendix, to get a set of for {v, w) ^ J for the reduced 
problem; the goal here is to get necessary conditions by relaxing the dual problem defining k G 
L and find an approximate solution, while for the sufficient condition, any candidate leads to a 
sufficient condition. It turns out that we will use the solution of the relaxed solution required for the 
necessary condition for the sufficient condition. 

If we assume that all variables in J are indeed active, then any optimal k E L must be such 
that = if u G J and w G J^. We then let free Ki,w for v,w in J. Our goal is to find good 
candidates for those free dual parameters. 

We first derive necessary conditions by lowerbounding the sums by maxima: 

max dr. > K,„„a Kyja ^ max d~ max K,,,,,a K^na, 
which can be minimized in closed form with respect to k leading to 
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and to the lower bound 

mm max a., > SmjQ^ A^a ^ max -= —77. (4) 

v&vnJ- ^^^^ «;e J- (E^eA(«;)nj- ^^^')^ 

For sufficient conditions, we simply take the value obtained before for k, which leads to 

,-2 \ ^ 2 Tt^ ^ \ " a'''K^a 
mm max a,, > K,,,,,a il^a ^ max > -= —k 

We have moreover 



neA{ii)) iieA{-!ii)nj<= j;eA(io)nD(t) 
leading to the desired upper bound 

min max d.T^ k^,.,,cJ K^a ^ max -= — " -^w" — 

A.6 Optimality conditions for the primal formulation 

We know derive optimaUty conditions for the problem in the paper, which we will need in Section iBl 
i.e.: 

, min iE^=l^^(E.ey(A,^.(^^))) + |(E.6yd.ll/3D(.)ll)'• 
P^l Ivev-' 

Let (3 G , with J being the hull of the active variables. The directional derivative in the direction 
A e is equal to 



n 

+A(^d„||/3D(„)|| ) (E^-^^^ A,+ J;4||Ad(.)|| ) 
/ PD{.)njll J 

and thus /3 if optimal if and ony if, we have, with 6 = J2veJ ^v\\PD{v)nj\\- 

V^e J, -V¥.:(V(/3.,<i>.(x,)))$»(x,) + A5( V ^^](3^ = 

i=l iugJc dgJ Vi'GJ^ / 

Note that when regularizing by A ^^^y (^v\\Pd{v) II instead of | (I]„gy dv\\PY){v) ll)^> we have 
the same optimality condition with 6 = 1. 
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B Consistency conditions 



We assume that we are in the finite dimensional setting (i.e., each Tv has finite dimensions /^) with 
the square loss. For t/; G we let denote G the matrix whose n-th row is ^^ixi). We 

let denote 5]^^ G M-^" ^ the population covariance between (x) and (x). The full covariance 
matrix, defined from the blocks is assumed invertible. With these assumptions, we can follow 
the approach of lITSl [191 l20l : that is, if A,j tends to zero faster than then the estimate (3 

converges in probability to the generating /3, and we have the expansion (3 = (5 + A„7 where 7 is 
the solution of the following optimization problem, with 5 = Yl,vew ll/^D(i>) II • 

^ T-^ , J J I^D{v)r\W ~^ , X J II II 

7en»K^»2 ll/3D(.)nwll 

The consistency condition is then obtained by studying when the first order expansion indeed has 
the correct sparsity pattern (for more precise statements and arguments, see |[T9l ). We let denote 
7vi/ the solution of the previous problem, restricted to 7^^^= = 0. We have: 



Following the previous section, it is optimal if and only for all A G W^, 



>w- 



^vv^wwlw + ^y ^ f^t)||^D(i.)ll 1 ^ 0- 



We let denote 

Aw^ = 5~^^w^wlw = ^w^w^ww^''^^ (E.gam W^~^\) ^^w'^^' 
The condition for good pattern selection is that for all A G W^, 

/^WAW-+ (^^,||AD(^,)|| ^ 0, 

which is exactly equivalent to || Avi^c ||* ^ 1, where x 1— >■ ||a;||* is the dual norm of the norm 1— > 
Ylfvew ^i^ll^D(i;) II- This dual norm may be computed in closed form in the unstructured case, 
where T>{v) = v, and is equal to the ^°°-norm. In general, it cannot be computed in closed form. 
However, we can give the following lower and upper bounds that lead to the desired propositions of 
the main paper. 
We have: 

^ 4||Ad(„)IK X! X] '^t^ii^'i'ii = X] 4 iiA.a,ii, 

which leads to the upper bound 



I II* ^ r^f 

\x\\ ^ max ■= — 
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Moreover, we have: 

I X] ^fll^D(?;)ll ) = ^i'^'''II^D(i;)IIII^D(i;') 

|2 



= ^ ^ ||A^||^ ^ 

wew'^ t)eA(«))nVK=-«'eA(i«)nVK= 



which leads to the lower bound: 



\Xw II 



2 • 
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